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ABSTRACT 



Multiple instances of operating systems execute coopera- 
tively in a single multiprocessor computer wherein all 
processors and resources are electrically connected togedier. 
The single physical machine with multiple physical proces- 
sors and resources is subdivided by software into multiple 
partitions, each with the ability to run a distinct copy, or 
instance, of an operating system. At different times, different 
operating system instances may be loaded on a given 
partition. Resources, such as CPUs and memory, can be 
dynamically assigned to different partitions and used by 
instances of operating systems running within the machine 
by modifying the configuration. The partitions themselves 
can also be changed without rebooting the system by modi- 
fying the configuration tree. CPUs, in particular, may be 
migrated, or reassigned, from one operating system instance 
to another, allowing different loads in the system to be 
accommodated. The migrations involve storing the process- 
ing context of a migrating processor prior to its reassignment 
and, after reassignment, loading any previous processing 
context that it may have stored from a previous execution 
with the partition to which it is reassigned. Hardware flags 
are also provided which include an identification of which 
CPU belongs to which partition, and an availability indicator 
for each CPU, which iadicates whether a given CPU is 
available for SMP operation. 

20 Claims, 13 Drawing Sheets 
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DYNAMICALLY ASSIGNING CPUS TO cation in order to optimally tune the machine resources to 

DIFFERENT PARTITIONS EACH HAVING the application. However, this approach has not been 

AN OPERATION SYSTEM INSTANCE IN A adopted by the majority of users because most sites have 

SHARED MEMORY SPACE many applications and separate databases developed by 

S different vendors. Therefore, it is difficult, and expensive, to 

FIELD OF THE INVENTION dedicate resources among all of the applications especially 

This invention relates to multiprocessor computer archi- \ enviromnents where the application mix is constantly 

tecturcs in which processors and other computer hardware ^ ^giiig- 

resources are grouped in partitions, each of which has an Alternatively, a computing system can be partitioned with 

operating system instance and, more specifically, to methods hardware to make a subset of the resources on a computer 

and apparatus for migrating computer hardware resources available to a specific application. This approach avoids 

from one partition to another without rebooting the com- dedicating the resources permanenUy since the partitions can 

puter system. changed, but still leaves issues concerning performance 

improvements by means of load balancing of resources 

BACKGROUND OF THE INVENTION 15 among partitions and resource availability. 

The efficient operation of many applicaUons in present ^he availability and maintainability issues were 

computing environments depends upon fast, powerful and addressed by a "shared everythmg" model m which a large 

flexible computing systems. The configuration and design of centralized robust server thai contains most of the resources 

such systems has become very complicated when such ^ networked with and services many small, uncomplicated 

systems are to be used in an "enterprise" commercial client network computers. Alternatively, "clusters" are used 

environment where there may be many separate m which each system or "node" has its own memory and is 

departments, many different problem types and continually controlled by its own operating system. The systems interact 

changing computing needs. Users in such environments sharing disks and passing messages among themselves 

generally want to be able to quickly and easily change the ^ ^ome type of communications network. A cluster system 

capacity of the system, its speed and its configuration. They ^^e advantage that additional systems can easily be 

may also want to expand the system work capacity and ^^^^^ ^ ^^l^^ter. However, networks and clusters suffer 

change configurations to achieve better utilization of from a lack of shared memory and from limited interconnect 

resources without stopping execution of application pro- bandwidth which places limitations on perforaiance. 

grams on the system. In addition they may want be able to 3Q In many enterprise computing environments, it is clear 

configure the system in order to maximize resource avail- that the two separate computing models must be simulta- 

ability so that each application will have an optimtmi neously accommodated and each model optimized. Several 

computing configuration. prior art approaches have been used to attempt this accom- 

TraditionaUy, computing speed has been addressed by modation. For example, a design called a "virtual machine" 

using a "shared nothing" computing architecture where data, 35 developed and marketed by International Business 

business logic, and graphic user interfaces are distinct tiers Machines Corporation, Armonk, N.Y, uses a single physical 

and have specific computing resources dedicated to each machine, with one or more physical processors, in combi- 

tier. Initially, a single central processing unit was used and "*^on with software which simulates multiple virtual 

the power and speed of such a computing system was machines. Each of those virtual machines has, in principle, 

increased by increasing the clock rate of the single central 40 ^"^^^s to all the physical resources of the underlying real 

processing unit. More recently, computing systems have computer. The assignment of resources to each virtual 

been developed which use several processors working as a machiiie is controlled by a program called a "hypervisor**. 

team instead one massive processor woricing alone. In this T^^re is only one hypervisor in the system and it is respon- 

manner, a complex application can be distributed among sible for all the physical resources. Consequently, the 

many processors instead of waiting to be executed by a 45 hypervisor, not the other operating systems, deals with the 

single processor Such systems typically consist of several allocation of physical hardware. The hypervisor intercepts 

central processing units (CPUs) which ar« controlled by a requests for resources from the other operating systems and 

single operating system. In a variant of a multiple processor ^i^h the requests in a globally-correct way. 

system called "symmetric multiprocessing" or SMP, the The VM architecture supports the concept of a "logical 

applications are distributed equally across all processors. 50 partition" or LPAR. Each LPAR contains some of the 

The processors also share memory. In another variant called available physical CPUs and resources which are logically 

"asymmetric multiprocessing" or AMP, one processor acts assigned to the partition. The same resources can be 

as a "master" and all of the other processors act as "slaves." assigned to more than one partition. LPARs are set up by an 

Therefore, all operations, including the operating system, administrator statically, but can respond to changes in load 

must pass through the master before being passed onto the 55 dynamically, and without rebooting, in several ways. For 

slave processors. These multiprocessing architectures have example, if two logical partitions, each containing ten CPUs, 

the advantage that performance can be increased by adding are shared on a physical system containing ten physical 

additional processors, but suffer from the disadvantage that CPUs, and, if the logical ten CPU partitions have comple- 

the software running on such systems must be carefully mentary peak loads, each partition can take over the entire 

written to take advantage of the multiple processors and it is physical ten CPU system as the workload shifts without a 

difficult to scale the software as the number of processors re -boot or operator intervention. 

increases. Current commercial workloads do not scale well in addition, the CPUs logically assigned to each partition 

beyond 8-24 CPUs as a single SMP system, the exact can be turned "on" and "off" dynamically via normal oper- 

number depending upon platform, operating system and ating system operator commands without re-boot. The only 

application mix. 65 limitation is that the number of CPUs active at system 

For increased performance, another typical answer has intitialization is the maximum number of CPUs that can be 

been to dedicate computer resources (machines) to an appli- turned ''on" in any partition. 
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Finally, in cases where the aggregate workload demand of of the operating system exists on the entire system. Inter-cell 
all partitions is more than can be delivered by the physical coordination allows application images to directly and trans- 
system, LPAR weights can be used to define how much of parently utilize processing, memory and I/O resources from 
the total CPU resources is given to each partition. These other cells without incurring the overhead of data copies or 
weights can be changed by operators on-the-fly with no 5 extra context switches. 

disruption. Another existing architecture called NUMA-Q developed 

Another prior art system is called a "Parallel Sysplex" and and marketed by Sequent Computer Systems, Inc., 

is also marketed and developed by the International Busi- Beaverton, Oreg. uses "quads", or a group of four processors 

ness Machines Corporation, This architecture consists of a per portion of memory, as the basic building block for 

set of computers that are clustered via a hardware entity lo NUMA-Q SMP nodes. Adding 1/0 to each quad further 

called a "coupling facility" attached to each CPU. The improves performance. Therefore, the NUMA-Q architec- 

coupliog facilities on each node are connected via a fiber- ture not only distributes physical memory but puts a prede- 

optic link and each node operates as a traditional SMP tennined number of processors and PCI slots next to each 

madiine, with a maximum of 10 CPUs. Certain CPU part. The memory in each quad is not local memory in the 

instructions directly invoke the coupling facility. For traditional sense. Rather, it is one third of the physical 

example, a node registers a data structure with the coupling memory address space and has a ^ecific address range. The 

facility, then the coupling facility takes care of keeping the address map is divided evenly over memory, with each quad 

data structures coherent within the local memory of each containing a contiguous portion of address space. Only one 

node. copy of the operating system is running and, as in any SMP 

The Enterprise 10000 Unix server developed and mar- ^° system, it resides in memory and runs processes without 

keted by Sun Microsystems, Mountain View, Calif., uses a distincUon and simultaneously on one or more processors. 

partitioning arrangement called "Dynamic System Accordingly, while many attempts have been made at 

Domains" to logically divide the resources of a single providing a flexible computer system having maximum 

physical server into multiple partitions, or domains, each of resource availability and scalability, existing systems each 

which operates as a stand-alone server. Each of the partitions ^ have significant shortcomings. Therefore, it would be desir- 

has CPUs, memory and I/O hardware. Dynamic reconfigu- able to have a new computer system design which provides 

ration allows a system administrator to create, resize, or improved flexibility, resource availability and scalabLUty. In 

delete domains on the fly and without rebooting. Every particular, it would be useful to have a computer system with 

domain remains logically isolated from any other domain in multiple processors that could be shared between different 

the system, isolating it completely from any software error operating systems running simultaneously in the system. 

or CPU, memory, or I/O error generated by any other That is, when the operational loads of the different partitions 

domain. There is no sharing of resources between any of the change, it would beneficial if exclusive control of one of the 

domains. processors could be transferred, i.e. migrated, from a first 

Tbe Hive Project conducted at Stanford University uses partition to a busier partition. In such a case, multiple 

an architecture which is structured as a set ofcclls. When the operating systems, each running different applications, 

system boots, each cell is assigned a range of nodes that it could have dynamic sharing of resources. Therefore, it 

owns throughout execution. Each cell manages the would be desirable to have a new computer system design 

processors, memory and I/O devices on those nodes as if it which provides improved flexibility, and resource migration 

were an independent operating system. The cells cooperate ^ capabilities. 

to present the illusion of a single system to user- level , . „ 

«r«!..cc^o SUMMARY OF THE INVENTION 

processes. 

Hive cells are not responsible for deciding how to divide In accordance with the principles of the present invention, 
their resources between local and remote requests. Each cell multiple instances of operating systems execute coopera- 
is responsible only for maintaining its internal resources and 45 lively in a single multiprocessor computer wherein all 
for optimizing performance within the resources it has been processors and resources are electrically connected together, 
allocated. Global resource allocation is carried out by a The single physical machine with multiple physical proces- 
user-level process called "wax." The Hive system attempts sors and resources is adaptively subdivided by software into 
to prevent data corruption by using certain fault containment multiple partitions, each with the ability to run a distinct 
boimdaries between the cells. In order to implement the tight 50 copy, or instance, of an operating system. Each of the 
sharing expected from a multiprocessor system despite the partitions has access to its own physical resources plus 
fault containment boundaries between cells, resource shar- resources designated as shared. In accordance with one 
ing is implemented through the cooperation of the various embodiment, the partitioning of resources is performed by 
cell kernels, but the policy is implemented outside the assigning resources within a configuration, 
kernels in the wax process. Both memory and processors can 55 More particularly, software logically, and adaptively, par- 
be shared. titions CPUs, memory, and 1/0 ports by assigning them 

A system called "Cellular IRIX*' developed and marketed together. An instance of an operating system may then be 

by Silicon Graphics Inc. Mountain View, Calif,, supports loaded on a partition. At different times, different operating 

modular computing by extending traditional symmetric mul- system instances may be loaded on a given partition. This 

tiprocessing systems. The Cellular IRIX architecture distrib- 60 partitioning, which a system manager directs, is a software 

utes global kernel text and data into optimized SMP-sized function; no hardware boundaries are required. Each indi- 

chunks or "cells". CeUs represent a control domain consist- vidual instance has the resources it needs to execute inde- 

ing of one or more machine modules, where each module pendently. Resources, such as CPUs and memory, can be 

consists of processors, memory, and I/O. Applications run- dynamically assigned to different partitions and used by 

ning on these cells rely extensively on a fiill set of local 65 instances of operating systems running within the machine 

operating system services, including local copies of operat- by modifying the configuration. The partitions themselves 

ing system text and kernel data structures. Only one instance can also be changed without rebooting the system by modi- 
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fyiag the configuration tree. The resultiag adaptively- 
partitioned, multi-processing (APMP) system exhibits both 
scalability and high performance. 

The invention includes a means for moving a processor 
from a first partition to a second partition. Such a movement 
requires the execution of an instruction by the moving 
processor, so that its acquiescence to the move (and that of 
the operating system instance on which it is nmning) is 
ensured. When a move is initiated, the migrating processor 
stores its current hardware slate, and loads a hardware state 
that it held during a previous execution within the second 
partition. Thus, the processor resumes operation in the 
second partition from where it left off previously. If there is 
no stored hardware state in the partition to where the 
processor is migrating, it is placed in an initialized state. 

The present system has interaction between the partitions 
that allows a processor to migrate firom one partition to the 
other without requiring a reboot of the entire system. Soft- 
ware running on its current partition, or a primary processor 
in its partition, can provide the processor to be moved with 
a request that it initiate a migration operation. Such a 
migration may occur with or without interruption of the 
operating system in which it resides. That is, the processor 
may simply be quiesced and reassigned while the rest of the 
system continues to operate, or the resources in its partition 
may be halted a console program is invoked to coordinate 
the move. 

To keep track of the processors in the system, each 
partition has a set of hardware fiags which includes flags that 
identify the partition with which each of the processors is 
associated, respectively. The hardware flags also indicate 
when a given processor is available to be used in SMP 
operation. By updating these flags each time a processor is 
migrated, the present status of each processor is known and 
retained within the hardware flags for use in any necessary 
reboots of the system. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The above and further advantages of the invention may be 
better understood by referring to the following description in 
conjunction with the accompanying drawings and which: 

FIG. 1 is a schematic block diagram of a hardware 
platform illustrating several system building blocks. 

FIG. 2 is a schematic diagram of an APMP computer 
system constructed in accordance with the principles of the 
present invention illustrating several partitions. 

FIG. 3 is a schematic diagram of a configuration tree 
which represents hardware resource configurations and soft- 
ware configurations and their component parts with child 
and sibling pointers. 

FIG. 4 is a schematic diagram of the configuration tree 
shown in FIG. 3 and rearranged to illustrate the assignment 
of hardware to software instances by ownership pointers. 

FIG. 5 is a flowchart outlining steps in an illustrative 
routine for creating an APMP computer system in accor- 
dance with the principles of the present invention. 

FIG. 6 is a flowchart illustrating the steps in an illustrative 
routine for creating entries in an APMP system management 
database which maintains information concerning the APMP 
system and its configuration. 

FIGS. 7Aand 7B, when placed together, form a flowchart 
illustrating in detail the steps in an illustrative routine for 
creating an APMP computer system in accordance with the 
principles of the present invention. 

FIGS. 8Aand 8B, when placed together, form a flowchart 
illustrating the steps in an illustrative routine followed by an 
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operating system instance to join an APMP computer system 
which is already created. 

FIG. 9 is a flowchart Dlustrating the steps in an illustrative 
routine followed by a CPU which is migrating from one 
5 partition to another under a "PAL" type migration. 

FIG. 10 is a flowchart illustrating the steps in an illustra- 
tive routine followed by software in a partition to which a 
CPU is migrating. 

50 DETAILED DESCRIPTION OF THE 

PREFERRED EMBODIMENT 

A computer platform constructed in accordance with the 
principles of the present invention is a multi-processor 
system capable of being partitioned to allow the concurrent 

15 execution of multiple instances of operating system soft- 
ware. The system does not require hardware support for the 
partitioning of its memory, CPUs and I/O subsystems, but 
some hardware may be used to provide additional hardware 
assistance for isolating faults, and minimizing the cost of 

20 software engineering. The following specification describes 
the interfaces and data structures required to support the 
inventive software architecture. The interfaces and data 
structures described are not meant to imply a specific 
operating system must be used, or that only a single type of 

25 operating system will execute concurrently. Any operating 
system which implements the software requirements dis- 
cussed below can participate in the inventive system opera- 
tion. 

System Building Blocks 

30 The inventive software architecture operates on a hard- 
ware platform which incorporates multiple CPUs, memory 
and I/O hardware. Preferably, a modular architecture such as 
that shown in FIG. 1 is used, although those skilled in the art 
will understand that other architectures can also be used, 

35 which architectures need not be modular. FIG, 1 illustrates 
a computing system constructed of four basic system build- 
ing blocks (SBBs) 100-106. In the illustrative embodiment, 
each building block, such as block 100, is identical and 
comprises several CPUs 108-114, several memory slots 

40 (illustrated collectively as memory 120), an I/O processor 
118, and a port 116 which contains a switch (not shown) that 
can connect the system to another such system. However, in 
other embodiments, the building blocks need not be identi- 
cal. Large multiprocessor systems can be constmcted by 

45 connecting the desired number of system building blocks by 
means of their ports. Switch technology, rather than bus 
technology, is employed to connect building block compo- 
nents in order to both achieve the improved bandwidth and 
to allow for non-uniform memory architectures (NUMA). 

50 In accordance with the principles of the invention, the 
hardware switches are arranged so that each CPU can 
address all available memory and I/O ports regardless of the 
number of building bbcks configured as schematically 
illustrated by line 122. In addition, all CPUs may commu- 

55 nicate to any or all other CPUs in all SBBs with conven- 
tional mechanisms, such as inter-processor interrupts. 
Consequently, the CPUs and other hardware resources can 
be associated solely with software. Such a platform archi- 
tecture is inherently scalable so that large amounts of 

60 processing power, memory and I/O will be available in a 
single computer. 

An APMP computer system 200 constructed in accor- 
dance with the principles of the present invention from a 
software view is illustrated in FIG. 2. In this system, the 

65 hardware components have been allocated to allow concur- 
rent execution of multiple operating system instances 208, 
210, 212. 
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In a preferred embodiment, this allocation is performed In an initialized state, a partition has full ownership and 

by a software program called a "console" program, which, control of hardware components assigned to it and only the 

as will hereinafter be described in detail, is loaded into partition itself may release its components, 

memory at power up. Console programs are shown sche- In accordance with the principles of the invention, 

matically in FIG. 2 as programs 213, 215 and 217. The 5 resources can be reassigned from one initialized partition to 

console program may be a modification of an existing another. Reassignment of resources can only be performed 

administrative program or a separate program which inter- by the initialized partition to which the resource is currently 

acts with an operating system to control the operation of the assigned. When a partition is in an uninitialized state, other 

preferred embodiment. The console program does not vir- partitions may reassign its hardware components and may 

tualize the system resources, that is, it does not create any 10 delete it. 

software layers between the running operating systems 208, An uninitialized partition is a partition which has no 
210 and 212 and the physical hardware, such as memory and primary CPU executing either under control of a console 
I/O units (not shown in FIG. 2.) Nor is the state of the program or an operating system. For example, a partition 
running operating systems 208, 210 and 212 swapped to may be uninitialized due to a lack of sufficient resources at 
provide access to the same hardware. Instead, the inventive 15 power up to run a primary CPU, or when a system admin- 
system logically divides the hardware into partitions. It is the istrator is reconfiguring the computer system. When in an 
responsibility of operating system instance 208, 210, and uninitialized state, a partition may reassign its hardware 
212 to use the resources appropriately and provide coordi- components and may be deleted by another partition. Unas- 
nation of resource allocation and sharing. The hardware signed resources may be assigned by any partition, 
platform may optionally provide hardware assistance for the 20 Partitions may be organized into "conununities" which 
division of resources, and may provide fault barriers to provide the basis for grouping separate execution contexts to 
minimize the ability of an operating system to corrupt allow cooperative resource sharing. Partitions in the same 
memory, or afifect devices controlled by another operating community can share resources. Partitions that are not 
system copy. within the same community cannot share resources. 

Tbe execution environment for a single copy of an 25 Resources may only be manually moved between partitions 

operating system, such as copy 208 is called a "partition" that are not in the same community by the system admin- 

202, and the executing operating system 208 in partition 202 istrator by de-assigning the resource (and stopping usage), 

is called "instance" 208. Each operating system instance is and manually reconfiguring the resource. Communities can 

capable of booting and running independently of all other be used to create independent operating system domains, or 

operating system instances in the computer system, and can 30 to implement user policy for hardware usage. In FIG. 2, 

cooperatively take part in sharing resources between oper- partitions 202 and 204 have been organized into community 

ating system instances as described below. 230. Partition 206 may be in its own community 205. 

In order to run an operating system instance, a partition Communities can be constructed using the configuration tree 

must include a hardware restart parameter block (HWRPB), described below and may be enforced by hardware, 

a copy of a console program, some amount of memory, one 35 The Console Program 

or more CPUs, and at least one I/O bus which must have a When a computer system constructed in accordance with 
dedicated physical port for the console. The HWRPB is a the principles of the present invention is enabled on a 
configuration block which is passed between the console platform, multiple HWRPB 's must be created, multiple 
program and the operating system. console program copies must be loaded, and system 
Each of console programs 213, 215 and 217, is connected 40 resources must be assigned in such a way that each HWRPB 
to a console port, shown as ports 214, 216 and 218, is associated with specific components of the system. To do 
respectively. Console ports, such as ports 214, 216 and 218, this, the first console program to run will create a configu- 
generally come in the form of a serial line port, or attached ration tree structure in memory which represents all of the 
graphics, keyboard and mouse options. For the purposes of hardware in the system. The tree will also contain the 
the inventive computer system, the capability of supporting 45 software partitioning information, and the assignments of 
a dedicated graphics port and associated input devices is not hardware to partitions and is discussed in detail below, 
required, although a specific operating system may require More specifically, when the APMP system is powered up, 
it. The base assumption is that a serial port is sufficient for a CPU will be selected as a primary CPU in a conventional 
each partition. While a separate terminal, or independent manner by hardware which is specific to the platform on 
graphics console, could be used to display information 50 which the system is running. The primary CPU then loads a 
generated by each console, preferably the serial lines 220, copy of a console program into memory. This console copy 
222 and 224, can all be connected to a single multiplexer 226 is called a "master console" program. The primary CPU 
attached to a workstation, PC, or LAT 228 for display of initially operates under control of the master console pro- 
console information. gram to perform testing and checking assuming that there is 
It is important to note that partitions are not synonymous 55 a single system which owns the entire machine, 
with system building blocks. For example, partition 202 may Subsequently, a set of environment variables are loaded 
comprise the hardware in building blocks 100 and 106 in which define the system partitions. Finally, the master 
FIG. 1 whereas partitions 204 and 206 might comprise the console creates and initializes the partitions based on the 
hardware in building blocks 102 and 104, respectively. environment variables. In this latter process the master 
Partitions may also include part of the hardware in a building 60 console operates to create the configuration tree, to create 
block. additional HWRPB data blocks, to load the additional con- 
Partitions can be "initialized" or "uninitialized." An ini- sole program copies, and to start the CPUs on the alternate 
tialized partition has sufficient resources to execute an HWRPBs, Each partition then has an operating system 
operating system instance, has a console program image instance running on it, which instance cooperates with a 
loaded, and a primary CPU available and executing. An 65 console program copy also running in that partition. In an 
initialized partition may be under control of a console unconfigured APMP system, the master console program 
program, or may be executing an operating system instance. will initially create a single partition containing the primary 
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CPU, a minimum amount of memory, and a physical system The master console may generate a single copy of the tree 

administrator's console selected in a platform-specific way. which copy is shared by all operating system instances, or it 

Console program commands will then allow the system may replicate the tree for each instance. A single copy of the 

administrator to create additional partitions, and configure tree has the disadvantage that it can create a single point of 

I/O buses, memory, and CPUs for each partition. 5 failure in systems with independent memories. However, 

After associations of resources to partitions have been platforms that generate multiple tree copies require the 

made by the console program, the associations are stored in console programs to be capable of keeping changes to the 

non-volatile RAM to allow for an automatic configuration of tree synchronized. 

the system during subsequent boots. During subsequent The configuration tree comprises multiple nodes includ- 

boots, the master console program must validate the current ing root nodes, child nodes and sibling nodes. Each node is 

configuration with the stored configuration to handle the formed of a fixed header and a variable length extension for 

removal and addition of new components. Newly-added overlaid data structures. The tree starts with a tree root node 

components are placed into an unassigned state, until they 302 representing the entire system box, followed by 

are assigned by the system administrator. If the removal of branches that describe the hardware configuration (hardware 

a hardware component results in a partition with insufficient root node 304), the software configuration (software root 

resources to run an operating system, resources will con- node 306), and the minimum partition requirements 

tinue to be assigned to the partition, but it will be incapable (template root node 308.) In FIG. 3, the arrows represent 

of running an operating system instance until additional new child and sibling relationships. The diildren of a node 

resources are allocated to it. represent component parts of the hardware or software 

As previously mentioned, the console program commu- configuration. Siblings represent peers of a component that 
nicates with an operating system instance by means of an 20 may not be related except by having the same parent. Nodes 
HWRPB which is passed to the operating system during in the tree 300 contain information on the software com- 
operating system boot up. The fundamental requirements for munities and operating system instances, hardware 
a console program are that it should be able to create configuration, configuration constraints, performance 
multiple copies of HWRPBs and itself. Each HWRPB copy boundaries and hot-swap capabilities. The nodes also pro- 
created by the console program will be capable of booting an 25 vide the relationship of hardware to software ownership, or 
independent operating system instance into a private section the sharing of a hardware component, 
of memory and each operating system instance booted in The nodes are stored contiguously in memory and the 
this manner can be identified by a unique value placed into address oflket from the tree root node 302 of the tree 300 to 
the HWRPB. The value indicates the partition, and is also a specific node forms a "handle" which may be used from 
used as the operating system instance ID. 30 any operating system instance to unambiguously identify the 

In addition, the console program is configured to provide same component on any operating system instance. In 

a mechanism to remove a CPU from the available CPUs addition, each component in the inventive computer system 

within a partition in response to a request by an operating has a separate ID. This may illustratively be a 64-bit 

system running in that partition. Each operating system unsigned value. The ID must specify a unique component 

instance must be able to shutdown, halt, or otherwise crash 35 when combined with the type and subtype values of the 

in a manner that control is passed to the console program. component. That is, for a given type of component, the ID 

Conversely, each operating system instance must be able to must identify a specific component. The ID may be a simple 

reboot into an operational mode, independently of any other number, for example the CPU ID, it may be some other 

operating system instance. unique encoding, or a physical address. The component ID 

Each HWRPB which is created by a console program will 40 and handle allow any member of the computer system to 
contain a CPU slot-specific database for each CPU that is in identify a specific piece of hardware or software. That is, any 
the system, or that can be added to the system without partition using either method of specification must be able to 
powering the entire system down. Each CPU that is physi- use the same specification, and obtain the same result, 
cally present will be marked "present", but only CPUs that As described above, the inventive computer system is 
will initially execute in a specific partition will be marked 45 composed of one or more communities which, in turn, are 
"available" in the HWRPB for the partition. The operating composed of one or more partitions. By dividing the parti- 
system instance running on a partition will be capable of tions across the independent communities, the inventive 
recognizing that a CPU may be available at some future time computer system can be placed into a configuration in which 
by a present (PP) bit in a per-CPU state flag fields of the sharing of devices and memory can be limited. Communities 
HWRPB, and can build data structures to reflect this. When 50 and partitions will have IDs which are densely packed. The 
set, the available (PA) bit in the per-CPU state flag fields hardware platform will determine the maximum number of 
indicates that the associated CPU is currently associated partitions based on the hardware that is present in the 
with the partition, and can be invited to join SMP operation. system, as well as having a platform maximum limit. 
The Configuration Tree Partition and community IDs will never exceed this value 

As previously mentioned, the master console program 55 during mntime. IDs will be reused for deleted partitions and 
creates a configuration tree which represents the hardware communities. The maximum number of communities is the 
configuration, and the assignment of each component in the same as the maximum number of partitions. In addition, 
system to each partition. Each console program then iden- each operating system instance is identified by a unique 
tifies the configuration tree to its associated operating system instance identifier, for example a combination of the parti- 
instance by placing a pointer to the tree in the HWRPB. 60 tion ID plus an incarnation number. 

Referring to FIG. 3, the configuration tree 300 represents The communities and partitions are represented by a 

the hardware components in the system, the platform con- software root node 306, which has community node children 

straints and minimums, and the software configuration. The (of which community node 310 is shown), and partition 

master console program builds the tree using information node grandchildren (of which two nodes, 312 and 314, are 

discovered by probing the hardware, and from information 65 shown.) 

stored in non-volatile RAM which contains configuration The hardware components are represented by a hardware 

information generated during previous initializations. root node 304 which contains children that represent a 
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hierarchical representation of all of the hardware currently the buses. However, the tree may also end at any level, if all 
present in the computer system. "Ownership" of a hardware components below it cannot be configured independently, 
component is represented by a handle in the associated System software will still be required to probe for bus and 
hardware node which points to the appropriate software device information not provided by the tree, 
node (310, 312 or 314.) These handles are illustrated in FIG. 5 The console program implements and enforces configu- 
4 which will be discussed in more detail below. Components ration constraints, if any, on each component of the system, 
that are owned by a specific partition will have handles that In general, components are either assignable without con- 
point to the node representing the partition. Hardware which straints (for example, CPUs may have no constraints), or are 
is shared by multiple partitions (for example, memory) will configurable only as a part of another component (a device 
have handles that point to the community to which sharing lO adapter, for example, may be configurable only as a part of 
is confined. Un-owned hardware will have a handle of zero its bus). A partition which is, as explained above, a grouping 
(representing the tree root node 302). of CPUs, memory, and I/O devices into a unique software 
Hardware components place configuration constraints on entity also has minimum requirements. For example, the 
how ownership may be divided. A "config" handle in the minimum hardware requirements for a partition are at least 
configuration tree node associated with each component 15 one CPU, some private memory (platform dependent 
determines if the component is free to be associated any- minimum, including console memory) and an 1/0 bus, 
where in the computer system by pointing to the hardware including a physical, non-shared, console port, 
root node 304. However, some hardware components may The minimal component requirements for a partition are 
be bound to an ancestor node and must be configured as part provided by the information contained in the template root 
of this node. Examples of this are CPUs, which may have no 20 node 308. The template root node 308 contains nodes, 316, 
constraints on where they execute, but which are a compo- 318 and 320, representing the hardware components that 
nent part of a system building block (SBB), such as SBBs must be a provided to create a partition capable of execution 
322 or 324. In this case, even though the CPU is a child of of a console program and an operating system instance, 
the SBB, its config handle will point to the hardware root Configuration editors can use this information as the basis to 
node 304. An I/O bus, however, may not be able to be owned 25 determine what types, and how many resources must be 
by a partition other than the partition that owns its I/O available to form a new partition. 

processor. In this case, the configuration tree node repre- During the construction of a new partition, the template 

senting the I/O bus would have a config handle pointing to subtree will be "walked", and, for each node in the template 

the I/O processor. Because the rules governing hardware subtree, there must be a node with the same type and subtype 

configuration are platform specific, this information is pro- 30 owned by the new partition so that it will be capable of 

vided to the operating system instances by the config handle. loading a console program and booting an operating system 

Each hardware component also has an "aflSnity" handle. instance. If there are more than one node of the same type 

The affinity handle is identical to the config handle, except and subtype in the template tree, there must also be multiple 

that it represents a configuration which will obtain the best nodes in the new partition. The console program will use the 

performance of the component. For example, a CPU or 35 template to validate that a new partition has the minimum 

memory may have a config handle which allows it to be requirements prior to attempting to load a console program 

configured anywhere in the computer system (it points to the and initialize operation. 

hardware root node 304), however, for optimal performance. The following is a detailed example of a particular 
the CPU or memory should be configured to use the System implementation of configuration tree nodes. It is intended 
BuildingBlockofwhichthey are apart. The result is that the 40 for descriptive purposes only and is not intended to be 
config pointer points to the hardware root node 304, but the limiting. Each HWRPB must point to a configuration tree 
affinity pointer points to an SBB node such as node 322 or which provides the current configuration, and the assign- 
node 324. The affinity of any component is platform specific, ments of components to partitions. A configuration pointer 
and determined by the firmware. Firmware may use this (in the CONFIG field) in the HWRPB is used to point to the 
information when asked to form "optimal" automatic con- 45 configuration tree. The CONFIG field points to a 64-byte 
figurations. header containing the size of the memory pool for the tree. 

Each node also contains several flags which indicate the and the initial checksum of the memory. Immediately fol- 

type and state of the node. These flags include a node_ lowing the header is the root node of the tree. The header and 

hotswap flag which indicates that the component represented root node of the tree will be page aligned, 

is a "hot swappable'' component and can be powered down 50 The total size in bytes of the memory allocated for the 

independently of its parent and siblings. However, all chil- configuration tree is located in the first quadword of the 

dren of this node must power down if this component header. The size is guaranteed to be in multiples of the 

powers down. If the children can power down independently hardware page size. The second quadword of the header is 

of this component, they must also have this bit set in their reserved for a checksum. In order to examine the configu- 

corresponding nodes. Another flag is a node_unavailable 5S ration tree, an operating system instance maps the tree into 

flag which, when set, indicates that the component repre- its local address space. Because an operating system 

sented by the node is not currently available for use. When instance may map this memory with read access allowed for 

a component is powered down (or is never powered up) it is all applications, some provision must be made to prevent a 

flagged as unavailable. non-privileged application from gaining access to console 

TWo flags, node_hardware and nod6__template, indicate 60 data to which it should not have access. Access may be 

the type of node. Further flags, such as node_initialized and restricted by appropriately allocating memory. For example, 

node_cpu_primary may also be provided to indicate the memory may be page aligned and allocated in whole 

whether the node represents a partition which has been pages. Normally, an operating system instance will map the 

initialized or a CPU that is currently a primary CPU. first page of the configuration tree, obtain the tree size, and 

The configuration tree 300 may extend to the level of 65 then remap the memory allocated for configuration tree 

device controllers, which wfll allow the operating system to usage. The total size may include additional memory used 

build bus and device configuration tables without probing by the console for dynamic changes to the tree. 
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Preferably, configuration tree nodes are formed with fixed 
headers, and may optionally contain type -specific informa- 
tion following the fixed portion. The size field contains the 
full length of the node, nodes arc illustratively allocated in 
multiples of 64-bytes and padded as needed. The following 
description defines illustrative fields in the fixed header for 
a node: 



typcdef etnict _gct_node { 



unsigned char 


type; 


unsigned char 


subtype; 


uintl6 


size; 


GCr_HANDLE 


owner, 


GCr^HANDLE 


current_owner; 


GCT_ID 


id; 


union { 




uint64 


node_fiags; 


struct { 




unsigned 


node_haidware 


unsigned 


node^hotswap 


unsigned 


nod6_unavailabl6 


unsigned 


nDdc_hw_tBniplate 


unsigned 


node_initializcd 


unsigned 


node_cpu_primary 



#dcnnc NODE JHARDWARE 
#deflnc NODEJCyrSWAP 
#de!flnB NODE_UNAVAILABLE 
#dcfine NODE_HW_TEMPLATE 
#dcftne NODE_INrnALIZED 
#define NODE_PR[MARY 



0x001 
0x002 
0x004 
0x008 
0x010 
0x020 



} flflg_bits; 




} flag_union; 




GCr_HANDLE 


oonfig; 


GCr_HANDUE 


affinity; 


GCr_HANDUB 


parent; 


GCT_HANDLE 


next_5ib; 


GCT_HANDLE 


prcv_sib; 


GCT_HANDUE 


chad; 


GCr^HANDLE 


reserved; 


uint32 


magic 


} GCX_>ODE; 





20 



25 



30 



35 



In the above definition the type definitions "unit" are 
unsigned integers with the appropriate bit lengths. As pre- 
viously mentioned, nodes are located and identified by a 
handle (identified by the typedef GOT_HANDLE in the 
definition above). An illustrative handle is a signed 32-bit 
offiset from the base of the configuration tree to the node. The 
value is unique across all partitions in the computer system. 
That is, a handle obtained on one partition must be valid to 
lookup a node, or as an input to a console callback, on all 
partitions. The magic field contains a predetermined bit 
pattern which indicates that the node is actually a valid node. 

Ihe tree root node represents the entire system. Its handle 
is always zero. That is, it is always located at the first 
physical location in the memory allocated for the configu- 
ration tree following the config header. It has the following 
definition: 



typedef 6truct_gct_root_nodc { 



OCr_NODE 


hd; 


uint64 


lock; 


uint64 


transientjevel; 


uint64 


cuncnt_levcl; 


uint64 


ccnsolc_req; 


uint64 


min_aIloc; 


uiQt64 


miiLjalign; 


uint64 


base_ailoc; 


uinl64 


base_align; 


mnl64 


max^phys^addrcss; 


uint64 


mem^ize; 



40 



50 
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-continued 



10 



} 



uint64 

int32 

OCT_HANDLE 

OCr_JHANDLE 

0CT_J1ANDLE 

GCr_JIANDLE 

OCrjiANDLE 

uint32 

int32 

int32 

uint32 

ujnt32 

uint32 

char 

char 

iiit32 



GCT_ROOT_J^ODE; 



platform^type; 

platfornu-name; 

pritnary_in5tanc6; 

{irst_ficc; 

high_JiniLt; 

lookaside; 

available; 

max_j)artition; 

particionB; 

communities; 

max_platform_paitition; 

max__firagmcnts; 

tnax_desc; 

APMP_idll6); 

APMP_Jd_pad[4]; 

bindings; 



The fields in the root node are defined as follows: 
lock 

This field is used as a simple lock by software wishing to 
inhibit changes to the structure of the tree, and the 
software configuration. When this value is-1 (all bits 
on) the tree is unlocked; when the value is >=0 the tree 
is locked. This field is modified using atomic opera- 
tions. The caller of the lock routine passes a partition ID 
which is written to the lock field. This can be used to 
assist in fault tracing, and recovery during crashes. 
transient_level 

This field is incremented at the start of a tree update, 
current_level 

This field is updated at the completion of a tree update. 
console_req 

This field specifies the memory required in bytes for the 
console in the base memory segment of a partition. 
min_alloc 

This field holds the minimum size of a memory firagment, 
and the allocation unit (fragments size must be a 
multiple of the allocation ). It must be a power of 2. 
min_align 

This field holds the alignment requirements for a memory 
fragment. It must be a power of 2. 
base_alloc 

This field specifies the minimum memory in bytes 

(including console req) needed for the base memory 

segment for a partition. This is where the console, 
console stmctures, and operating system will be loaded 
for a partition. It must be greater or equal to minAlloc 
and a multiple of minAlloc. 
base_aligo 

This field holds the alignment requirement for the base 
memory segment of a partition. It must be a power of 
2, and have an alignment of at least min_align. 
niax_phys_address 
The field holds the calculated largest physical address that 
could exist on the system, including memory sub- 
systems that are not currently powered on and avail- 
able. 
mem_size 

go This field holds the total memory currently in system. 
platform_type 

This field stores the type of platform taken from a field in 
the HWRPB. 
platform_name 
65 This field holds an integer oflket from the base of the tree 
root node to a string representing the name of the 
platform. 
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primary_instance biodiag, and the affinity binding for a node type. 

This field stores the partition ID of the first operating Bindings are used by software to determine how node 

system instance. types are related and configuration and afBnity rules. 

fiist_Jree A community provides the basis for the sharing of 
Ibis field holds the offset from the tree root node to the ^ resources between partitions. While a hardware component 

first free byte of memory pool used for new nodes. assigned to any partition in a community, the actual 

high^iiroit sharing of a device, such as memory, occurs only within a 

TOs field holds the highest address at which a valid node <^°^'^^fy- """"^""^"p^p^i^T"^' l^uT '° 

can be located withto the configuration tree. It is used f control secUon called an APMP database, which allows 

by callbacks to validate that a handle is legal. '° °P"="'"S *y*'«°» "^^^J"^^ '° """'f"' ^""^ 

lookaside bership in the community for the purpose of sharing memory 

^ , „ - , , , and communications between instances. The APMP data- 

•nus field IS the handle of a hnked hst of nod^ that have jj^^ communities are discussed in detail 

been deleted, and that may be reclaimed. When a configuration ID for the community is a signed 

community or parUUon are deleted, the node is hnked ■ ^^j^^ j ^ ^^y^ ^ 

mlo this hst and creation of a new partition or com- ^^^^ ^^^^^ ^^ ^ maximum number of 

mumty will look at this hst before aUocating from free ^^^^^^^ ^^^^^^ pj^jj^^ 

P'™^ A partition node, such as node 312 or 314, represents a 

available collection of hardware that is capable of running an indc- 
Ihis field holds the number of bytes remaining in the free jo pendent copy of the console program, and an independent 

pool pointed to by the fiist_Jree field. copy of an operating system. The configuration ID for this 

max_partitions node is a signed 1 6-bit integer value assigned by the console. 

Tliis field holds the maximum number of partitions com- The ID will never be greater than the maximum number of 

puted by the platform based on the amount of hardware partitions that can be created on the platform. Hie node has 
resources currently available. 25 the definition: 
partitions 

This field holds an offset from the base of the root node 

to an array of handles. Each partition ID is used as ao ... 

index into this array, and the partition node handle is ^ GCTl5raDE^^^^°iidr^° ^ 

stored at the indexed location. When a new partition is 30 uint64 hvwpb; 

created, this array is examined to find the first partition uint64 incarnation; 

ID which docs not have a corresponding partition node 0"°"^^*. 

handle and this partition ID is used as the ID for the ^^3^ pa^u^servcd_i; 

new partition. uint64 instanoe_jiame_fonnat; 

communities 35 c har Lnstance_name[128]; 

Hiis field also holds an offset from the base of the root } Gcr^AKrrnoN_NODE; 

node to an array of handles. Each community ID is used 

an index into this array, and a community node handle The defined fields have the definitions: 

is stored in the array. When a new community is hwrpb 

created, this array is examined to find the first commu- ^ This field holds the physical address of the hardware 

nity ID which does not have a corresponding commu- restart parameter block for this partition. To minimize 

nity node handle and this community ID is used as the changes to the HWRPB, the HWRPB does not contain 

ID for the new community. There cannot be more a pointer to the partition, or the partition ID. Instead, 

communities than partitions, so the array is sized based the partition nodes contain a pointer to the HWRPB. 

on the maximum number of partitions. System software can then determine the partition ID of 

max_platform_partition the partition in which it is running by searching the 

This field holds the maximum number of partitions that partition nodes for the partition which contains the 

can simultaneously exist on the platform, even if addi- physical address of its HWRPB. 

tional hardware is added (potentially inswapped). incarnation 

max fragments This field holds a value which is incremented each time 

This field holds a platform defined maximum number of the primary CPU of the partition executes a boot or 

fragments into which a memory descriptor can be restart operation on the partition, 

divided. It is used to size the array of fragments in the priority 

memory descriptor node. This field holds a partition priority. 

max_desc os_type 

This field holds the maximum number of memory This field holds a value which indicates the type of 

descriptors for the platform. operating system that will be loaded in the partition. 

APMP_Jd partition_reserved_l 

This field holds a system ID set by system software and This field is reserved for future use. 

saved in non-volatile RAM. instance_name_Jormat 

APMP_id_pad This field holds a value that describes the format of the 

This field holds padding bytes for the APMP ID. instance name string, 

bindings instance_name 

This field holds an oflfeet to an array of "bindings." Each 65 This field holds a formatted string which is interpreted 

binding entry describes a type of hardware node, the using the instance_name_Jbrmat field. The value in 

type of node the parent must be, the configuration this field provides a high-level path name to the oper- 
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ating system instance executing in the partition. This 
field is loaded by system software and is not saved 
across power cycles. The field is cleared at power up 
and at partition creation and deletion. 
A System Building Block node, such as node 322 or 324, 
represents an arbiUrary piece of hardware, or conceptual 
grouping used by system platforms with modular designs 
such as that illustrated in FIG. 2. A QBB (Quad Building 
Block) is a specific example of an SBB and corresponds to 
units such as units 100, 102, 104 and 106 in FIG. 1. ChUdreD 
of the SBB nodes 322 and 324 include input/output proces- 
sor nodes 326 and 340. 

CPU nodes, such as nodes 328-332 and 342-346, are 
assumed to be capable of operation as a primary CPU for 
SMP operation. In the rare case where a CPU is not primary 
capable, it will have a SUBTYPE code indicating that it 
cannot be used as a primary CPU in SMP operation. This 
information is critical when configuring resources to create 
a new partition. The CPU node will also carry information 
on where the CPU is currently executing. The primary for a 
partition will have the NODE_CPU_JPRIMARY flag set in 
the NODE_FLAGS field. The CPU node has the following 
definition: 



typedcf struct_gct_cpu_nodc { 
GCT_NODE hd; 
} GCr_CPU_NODE; 



A memory subsystem node, such as node 334 or 348, is 
a "pseudo" node that groups together nodes representing the 
physical memory controllers and the assignments of the 
memory that the controllers provide. The children of this 
node consist of one or more memory controller nodes (such 
as nodes 336 and 350) which the console has configured to 
operate together (interleaved), and one or more memory 
descriptor nodes (such as nodes 338 and 352) which 
describe physically contiguous ranges of memory. 

A memory controller node (such as nodes 336 or 350) is 
used to express a physical hardware component, and its 
owner is typically the partition which will handle errors, and 
initialization. Memory controllers cannot be assigned to 
communities, as they require a specific operating system 
instance for initialization, testing and errors. However, a 
memory description, defined by a memory descriptor node, 
may be split into ''fragments" to allow different partitions or 
communities to own specific memory ranges within the 
memory descriptor. Memory is unlike other hardware 
resources in that it may be shared concurrently, or broken 
into "private" areas. Each memory descriptor node contains 
a list of subset ranges that allow the memory to be divided 
among partitions, as well as shared between partitions 
(owned by a community). A memory descriptor node (such 
as nodes 338 or 352) is defined as: 



typedcf strucL_gCt_imein_dcsc_nodc { 
OCr^ODE hd; 
0Cr31EM_INP0 memjnfo; 
ial32 mem^rag; 

}GCr_MEM_DESCL-NODE; 



,109 Bl 

18 

The mem_jnfo structure has the following definition: 



J, typedef struct_gcL.niem_mfo { 

uint64 base_pa; 
iiiiit64 base_size; 
iunt32 desc_count; 
uint32 iiifo_fill; 
}GCT_31EM_INF0: 

10 

The mem_frag field holds an ofiket from the base of the 
memory descriptor node to an array of GCT_3tEM_DESC 
structures which have the definition:. 



15 



20 



25 



typedcf struct _gct__mem_desc { 




uint64 


pa; 




unit64 


size; 




GCr_HANDLE 


mem_owner; 




GCr_JIANDLE 


mem^cuirent. 


.owner, 


unioa { 
uint32 






mem— flags; 




struct { 






unsigacd 


mem^console 


: y, 


unsigned 


mem_private 


: 1; 


unsigacd 


mcm—shared 


: 1; 


unsigned 


base 


: 1; 


#dcfiiic CGT_MEM_CX)NSOLE 


0x1 


#defiiic CGT_MEM_JRIVAIE 


0x2 


#definc CGT_31EM_SHARED 


0x4 


#dcfinc CGT_MEM_CONSOLE 


0x8 


} flag—bits; 






} flag^umon; 







uint32 inem_fUl; 



}OCT_MEMJ)ESC; 



The number of fragments in a memory description node 
35 (nodes 338 or 352) is limited by platform firmware. This 
creates an upper bound on memory division, and limits 
unbounded growth of the configuration tree. Software can 
determine the maximum number of fragments from the 
max _Jragments field in the tree root node 302 (discussed 
40 above), or by calling an appropriate console callback func- 
tion to return the value. Each fragment can be assigned to 
any partition, provided that the config binding, and the 
ownership of the memory descriptor and memory subsystem 
nodes allow it. Each fragment contains a base physical 
45 address, size, and owner field, as well as flags indicating the 
type of usage. 

To allow shared memory access, the memory subsystem 
parent node, and the memory descriptor node must be owned 
by a community. The fragments within the memory descrip- 

50 tor may then be owned by the community (shared) or by any 
partition v^athin the community. 

Fragments can have mim'mxmi allocation sizes and align- 
ments provided in the tree root node 302. The base memory 
for a partition (the fragments where the console and oper- 

55 ating system will be loaded) may have a greater allocation 
and alignment than other fragments (see the tree root node 
definition above). If the owner field of the memory descrip- 
tor node is a partition, then the fragments can only be owned 
by that partition. 

60 FIG. 4 illustrates the configuration tree shown in FIG. 3 
when it is viewed from a perspective of ownership. The 
console program for a partition relinquishes ownership and 
control of the partition resources to the operating system 
instance running in that partition when the primary CPU for 

65 that partition starts execution. The concept of "ownership" 
determines bow the hardware resources and CPUs are 
assigned to software partitions and communities. The con- 



04/12/2004, EAST Version: 1.4.1 



us 6,247,109 Bl 



19 



10 



30 



figuration tree has ownership pointers illustrated in FIG. 4 
which determine the mapping of hardware devices to soft- 
ware such as partitions (exclusive access) and communities 
(shared access). An operating system instance uses the 
information in the configuration tree to determine to which ^ 
hardware resources it has access and reconfiguration control. 

Passive hardware resources which have no owner are 
unavailable for use until ownership is established. Once 
ownership is established by altering the configuration tree, 
the operating system instances may begin using the 
resources. When an instance makes an initial request, own- 
ership can be changed by causing the owning operating 
system to stop using a resource or by a console program 
taking action to stop using a resource in a partition where no ^ ^ 
operating system instance is executing. The configuration 
tree is then altered to transfer ownership of the resource to 
another operating system instance. The action required to 
cause an operating system to stop using a hardware resource 
is operating system specific, and may require a reboot of the 20 
operating system instances a£fected by the change. 

To manage the transition of a resource £rom an owned and 
active state, to a unowned and inactive state, two fields are 
provided in each node of the tree. The owner field represents 
the owner of a resource and is loaded with the handle of the 25 
owning software partition or community. At power up of an 
APMP system, the owner fields of the hardware nodes are 
loaded from the contents of non-volatile RAM to establish 
an initial configuration. 

To change the owner of a resource, the handle value is 
modified in the owner field of the hardware component, and 
in the owner fields of any descendants of the hardware 
component which are bound to the component by their 
config handles. The current_owner field represents the 
current user of the resource. When the owner and current_ 
owner fields hold the same non-zero value, the resource is 
owned and active. Only the owner of a resource can 
de-assign the resource (set the owner field to zero). A 
resource that has null owner and current_owner fields is 40 
unowned, and inactive. Only resources which have null 
owner and current_j3wner fields may be assigned to a new 
partition or community. 

When a resource is de-assigned, the owner may decide to 
deassign the owner field, or both the owner and current_ 45 
owner fields. The decision is based on the ability of the 
owning operating system instance running in the partition to 
discontinue the use of the resource prior to de-assigning 
ownership. In the case where a reboot is required to relin- 
quish ownership, the owner field is cleared, but the current_ 
owner field is not changed. When the owning operating 
system instance reboots, the console program can clear any 
current_owner fields for resources that have no owner 
during initialization. 

During initialization, the console program will modify the 
current_owner field to match the owner field for any node 
of which it is the owner, and for which the current^owner 
field is null System software should only use hardware of 
which it is the current owner. In the case of a de-assignment 
of a resource which is owned by a community, it is the 
responsibility of system software to manage the transition 
between slates. In some embodiments, a resource may be 
loaned to another partition. In this condition, the owner and 
current_owner fields are both valid, but not equal. The 65 
following table summarizes the possible resource states and 
the values of the owner and cuirent__owner fields: 



20 



TABLE 1 



owner field value currenL-owner field value Resource State 



none 
none 
valid 
valid 
valid 



none 
valid 
none 

equal to owner 

is not equal to owner 



vmowncd, and inactive 
unowned, but still active 
owned, not yet active 
owned and active 
loaned 



Because CPUs are active devices, and sharing of CPUs 
means that a CPU could be executing in the context of a 
partition which may not be its "owner", ownership of a CPU 
is different from ownership of a passive resource. The CPU 
node in the configuration tree provides two fields that 
indicate which partition a CPU is nominally "owned" by, 
and in which partition the CPU is currently executing. The 
owner field contains a value which indicates the nominal 
ownership of the CPU, or more specifically, the partition in 
wiiich the CPU will initially execute at system power up. 

Until an initial ownership is established (that is, if the 
owner field is unassigned), CPUs are placed into a HWRPB 
context decided by the master console, but the HWRPB 
available bit for the CPU will not be set in any HWRPB. 
This combination prevents the CPU from joining any oper- 
ating system instance in SMP operation. When ownership of 
a CPU is established (the owner field is filled in with a valid 
partition handle), the CPU will migrate, if necessary, to the 
owning partition, set the available bit in the HWRPB asso- 
ciated with that partition, and request to join SMP operation 
of the instance running in that partition, or join the console 
program in SMP mode. The combination of the present and 
available bits in the HWRPB tell the operating system 
instance that the CPU is available for use in SMP operation, 
and the operating system instance may use these bits to build 
appropriate per-CPU data structures, and to send a message 
to the CPU to request it to join SMP operation. 

When a CPU sets the available bit in an HWRPB, it also 
enters a value into the current_owner field in its correspond- 
ing CPU node in the configuration tree. The current_owner 
field value is the handle of the partition in which the CPU 
has set the active HWRPB bit and is capable of joining SMP 
operation. The current_owner field for a CPU is only set by 
the console program. When a CPU migrates from one 
partition to another partition, or is halted into an unassigned 
state, the cunent_owner field is cleared (or changed to the 
new partition handle value) at the same time that the 
available bit is cleared in the HWRPB. The currcnt_„owner 
field should not be written to directly by system software, 
and only reflects which HWRPB has the available bit set for 
the CPU. 

During runtime, an operating system instance can tem- 
porarily "loan" a CPU to another partition without changing 
the nominal ownership of the CPU. The traditional SMP 
concept of ownership using the HWRPB present and avail- 
able bits is used to reflect the current execution context of 
the CPU by modifying the HWRPB and the configuration 
tree in atomic operations. The currenl_owoer field can 
further be used by system software in one of the partitions 
to determine in which partition the CPU is currently execut- 
ing (other instances can determine the location of a particu- 
lar CPU by examining the configuration tree.) 

It is also possible to de-assign a CPU and return it into a 
state in which the available bit is not set in any HWRPB, and 
the current_owner field in the configuration tree node for 
the CPU is cleared. This is accomplished by halting the 
execution of the CPU and causing the console program to 
clear the owner field in the configuration tree node, as well 
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as the current_owner field and the available HWRPB bit. may need to be capable of using physical "shared 

The CPU will then execute in console mode and poll the memory" for communicating or sharing data with other 

owner field waiting for a valid partition handle to be written instances running within the computer if the instance is 

to it. System software can then establish a new owner, and part of a system in which memory is shared. In such a 

the CPU begin execution in the new partition. s shared memory system, an instance may need to be 

Illustrative ownership pointers arc illustrated in FIG. 4 by capable of mapping physical "shared memory" as iden- 

arrows. Each of the nodes in FIG. 4 that corresponds to a tified in the configuration tree into its virtual address 

similar node in FIG. 3 is given a corresponding number. For space, and the virtual address spaces of the "processes" 

example, the software root node denoted in FIG. 3 as node running within that operating system instance. 

306 is denoted as node 406 in FIG. 4. As shown in FIG. 4, lo 7. Each instance may need some mechanism to contact 

the community 410 is "owned" by the software root 406. another CPU in the computer system in order to commu- 

Likewise, the system building blocks 1 and 2 (422 and 425) nicate with it. 

are owned by the community 410. Similarly, partitions 412 8. An instance may also need to be able to recognize other 

and 414 are also owned by the community 410. CPUs that are compatible with its operations, even if the 

Partition 412 owns CPUs 428-432 and the I/O processor 15 CPUs are not currently assigned to its partition. For 

426. The memory controller 436 is also a part of partition 1 example, the instance may need to be able to ascertain 

(412). In a like manner, partition 2 (414) owns CPUs CPU parameters, such as console revision number and 

442-446, I/O processor 440 and memory controller 450. clock speed, to determine whether it could run with that 

The common or shared memory in the system is com- CPU, if the CPU was re-assigned to the partition in which 

prised of memory subsystems 434 and 448 and memory 20 the instance is running, 

descriptors 438 and 452. These are owned by the community Changing the Configuration Tree 

410. Thus, FIG. 4 describes the layout of the system as it Each console program provides a number of callback 

would appear to the operating system instances. functions to allow the associated operating system instance 

Operating System Characteristics to change the configuration of the APMP system, for 

As previously mentioned, the ilhistrative computer sys- 25 example, by creating a new community or partition, or 

tern can operate with severd different operating systems in altering the ownership of memory fragments. In addition, 

different partitions. However, conventional operating sys- other callback functions provide the ability to remove a 

tems may need to be modified in some aspects in order to community, or partition, or to start operation on a newly- 

make them compatible with the inventive system, depending created partition. 

on how the system is configured. Some sample modifica- 30 However, callback fimctions do not cause any changes to 

tions for the illustrative embodiment are listed below: take place on the running operating system instances. Any 

1. Instances may need to be modified to include a mecha- changes made to the configuration tree must be acted upon 
nism for choosing a "primary" CPU in the partition to run by each instance affected by the change. The type of action 
the console and be a target for communication from other that must take place in an instance when the configuration 
instances. The selection of a primary CPU can be done in 35 tree is altered is a function of the type of change, and the 
a conventional manner using arbitration mechanisms or operating system instance capabiUties. For example, moving 
other conventional devices. an input/output processor from one partition to another may 

2. Each instance may need modifications that allow it to require both partitions to reboot. Changing the memory 
communicate and cooperate with the console program allocation of fragments, on the other hand, might be handled 
which is responsible for creating a configuration data 40 by an operating system instance without the need for a 
block that describes the resources available to the parti- reboot. 

tion in which the instance is mrming. For example, the Configuration of an APMP system entails the creation of 

instance should not probe the underlying hardware to communities and partitions, and the assignment of unas- 

determine what resources are available for usage by the signed components. When a component is moved from one 

instance. Instead, if it is passed a configuration data block 45 partition to another, the current owner removes itself as 

that describes what resources that instance is allowed to owner of the resource and then indicates the new owner of 

access, it will need to work with the specified resources. the resource. The new owner can then use the resource. 

3. An instance may need to be capable of starting at an When an instance running in a partition releases a 
arbitrary physical address and may not be able to reserve component, the instance must no longer access the compo- 
any specific physical address in order to avoid conflicting 50 nent. This simple procedure eliminates the complex syn- 
with other operating systems nmning at that particular chronization needed to allow blind stealing of a component 
address. from an instance, and possible race conditions in booting an 

4. An instance may need to be capable of supporting instance during a reconfiguration. 

multiple arbitrary physical holes in its address space, if it Once initialized, configuration tree nodes will never be 

is part of a system configuration in which memory is 55 deleted or moved, that is, their handles will always be valid, 

shared between partitions. In addition, an instance may Thus, hardware node addresses may be cached by software, 

need lo deal with physical holes in its address space in Callback functions which purport to delete a partition or a 

order to support "hot inswap" of memory. community do not actually delete the associated node, or 

5. An mslance may need to pass messages and receive remove it from the tree, but instead flag the node as 
notifications that new resources arc available to partitions 60 UNAVAILABLE, and clear the ownership fields of any 
and instances. More particularly, a protocol is needed to hardware resource that was owned by the software compo- 
inform an instance to search for a new resource. nent. 

Otherwise, the instance may never realize that the In order to synchronize changes to the configuration tree, 

resource has arrived and is ready for use. the root node of the tree maintains two counters (transient^. 

6. An instance may need to be capable of running entirely 65 level and current __level). The transient_level counter is 
within its "private memory" if it is used in a system where incremented at the start of an update to the tree, and the 
instances do not share memory. Alternatively, an instance current_lev6l counter is incremented when the update is 
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complete. Software may use these counters to determine 
when a change has occurred, or is occurring to the tree. 
When an update is completed by a console, an interrupt can 
be generated to all CPUs in the APMP system. This interrupt 
can be used to cause system software to update its state 
based on changes to the tree. 
Creation of an APMP Computer System 

FIG. 5 is a flowchart that illustrates an overview of the 
formation of the illustrative adaptively-partitioned, multi- 
processor (APMP) computer system. The routine starts in 
step 500 and proceeds to step S02 where a master console 
program is started. If the APMP computer system is being 
created on power up, the CPU on which the master console 
runs is chosen by a predetermined mechanism, such as 
arbitration, or another hardware mechanism. If the APMP 
computer system is being created on hardware that is already 
running, a CPU in the first partition that tries to join the 
(non-existent) system runs the master console program, as 
discussed below. 

Next, in step 504, the master console program probes the 
hardware and creates the configuration tree in step 506 as 
discussed above. If there is more than one partition in the 
APMP system on power up, each partition is initialized and 
its console program is started (step 508). 

Finally, an operating system instance is booted in at least 
one of the partitions as indicated in step 510. The first 
operating system instance to boot creates an APMP database 
and fills in the entries as described below. APMP databases 
store information relating to the state of active operating 
system instances in the system. The routine then finishes in 
step 512. It should be noted that an instance is not required 
to participate in an APMP system. The instance can choose 
not to participate or to participate at a time that occurs well 
after boot. Those instances which do participate form a 
"sharing set." The first instance which decides to join a 
sharing set must create it. There can be multiple sharing sets 
operating on a single APMP system and each sharing set has 
its own APMP database. 

Deciding to Create a New APMP System or to Join an 
Existing APMP System 

An operating system instance running on a platform 
which is also running the APMP computer system does not 
necessarily have to be a member of the APMP computer 
system. The instance can attempt to become a member of the 
APMP system at any time after booting. This may occur 
either automatically at boot, or after an operator-command 
explicitly initiates joining. After the operating system is 
loaded at boot time, the operating system initialization 
routine is invoked and examines a stored parameter to see 
whether it specifies immediate joining and, if so, the system 
executes a joining routine which is part of the APMP 
computer system. An operator command would result in an 
execution of the same routine. 
APMP Daubase 

An important data structure supporting the inventive 
software allocation of resources is the APMP database which 
keeps track of operating system instances which are mem- 
bers of a sharing set. The first operating system instance 
attempting to set up the APMP computer system initializes 
an APMP database, thus creating, or instantiating, the inven- 
tive software resource allocations for the initial sharing set. 
Later instances wishing to become part of the sharing set 
join by registering in the APMP database associated with 
that sharing set. The APMP database is a shared data 
structure containing the centralized information required for 
the management of shared resources of the sharing set. An 
APMP database is also initialized when the APMP computer 
system is re-formed in response to an unrecoverable error. 
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More specifically, each APMP database is a three-part 
stmcture. The first part is a fixed-size header portion includ- 
ing basic synchronization structures for creation of the 
APMP computer system, address-mapping information for 
5 the database and offsets to the service-specific segments that 
make up the second portion. The second portion is an array 
of data blocks with one block assigned to each potential 
instance. The data blocks are called "node blocks.** The third 
portion is divided into segments used by each of the com- 
puter system sub-facilities. Each sub-facility is responsible 
for the content of, and synchronizing access to, its own 
segment. 

The initial, header portion of an APMP database is the first 
part of the APMP database mapped by a joining operating 
system instance. Portions of the header are accessed before 
the instance has joined the sharing set, and, in fact, before 
the instance knows that the APMP computer system exists. 

The header section contains: 

1. a membership and creation synchronization quadword 

2. a computer system software version 

3. state information, creation time, incarnation count, etc. 

4. a pointer (offset) to a membership mask 

5. crashing instance, crash acknowledge bits, etc. 

6. validation masks, including a bit for each service 

^ 7. memory mapping information (page frame number 

information) for the entire APMP database 
8. oflket/length pairs describing each of the service segments 

(lengths in bytes rounded to pages and ofOsets full pages) 

including 
3Q shared memory services 

cpu communications services 

membership services (if required) 

locking services 

The array of node blocks is indexed by a system partition 
35 id (one per instance possible on the current platform) and 
each block contains; 

instance software version 

interrupt reason mask 

instance state 
40 instance incarnation 

instance heartbeat 

instance membership timestamp 

little brother instance id and inactive-time; big brother 
instance id 

45 instance validation done bit. 

An APMP database is stored in shared memory. The initial 
fixed portion of N physically contiguous pages occupies the 
first N pages of one of two memory ranges allocated by the 
first instance to join during initial partitioning of the hard- 

50 ware. The instance directs the console to store the starting 
physical addresses of these ranges in the configuration tree. 
The purpose of allocating two ranges is to permit failover in 
case of hardware memory failure. Memory management is 
responsible for mapping the physical memory into virtual 

55 address space for the APMP database. 

The detailed actions taken by an operating system 
instance are Ulustrated in FIG. 6. More specifically, when an 
operating system instance wishes to become a member of a 
sharing set, it must be prepared to create the APMP com- 

60 puter system if it is the first instance attempting to "join" a 
non-existent system. In order for the instance to determine 
whether an APMP system already exists, the instance must 
be able to examine the state of shared memory as described 
above. Further, it must be able to synchronize with other 

65 instances which may be attempting to join the APMP system 
and the sharing set at the same time to prevent conflicting 
creation attempts. The master console creates the configu- 
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ration tree as discussed above. Subsequently, a region of 
memory is initialized by the first, or primary, operating 
system instance to boot, and this memory region can be used 
for an APMP database. 

Mapping the APMP Database Header S 

The goal of the initial actions taken by all operating 
system instances is to map the header portion of the APMP 
database and initialize primitive inter-instancc interrupt han- 
dling to lay the groundwork for a create or join decision. The 
routine used is illustrated in FIG, 6 which begins in step 600. lo 
The first action taken by each instance (step 602) is to 
engage memory management to map the initial segment of 
the APMP database as described above. At this time, the 
array of node blocks in the second database section is also 
mapped. Memory management maps the initial and second 15 
segments of the APMP database into the primary operating 
system address space and returns the start address and 
length. The instance then informs the console to store the 
location and size of the segments in the configuration tree. 

Next, in step 604, the initial virtual address of the APMP 20 
database is used to allow the initialization routine to zero 
intermpt reason masks in the node block assigned to the 
current instance. 

A zero initial value is then stored to the heartbeat field for 
the instance in the node block, and other node block fields. 25 
In some cases, the instance attempting to create a new APMP 
computer system was previously a member of an APMP 
system and did not withdraw from the APMP system. If this 
instance is rebooting before the other instances have 
removed it, then its bit will still be "on" in the system 30 
membership mask. Other unusual or error cases can also 
lead to "garbage" being stored in the system membership 
mask. 

Next, in step 608, the virtual address (VA) of the APMP 
database is stored in a private cell which is examined by an 35 
inter-processor interrupt handler. The handler examines this 
cell to determine whether to test the per-instance interrupt 
reason mask in the APMP database header for work to do. 
If this cell is zero, the APMP database is not mapped and 
nothing further is done by the handler. As previously 40 
discussed, the entire APMP database, including this mask, is 
initialized so that the handler does nothing before the 
address is stored. In addition, a clock interrupt handler can 
examine the same private cell to determine whether to 
increment the instance-specific heartbeat field for this 45 
instance in the appropriate node block. If the private cell is 
zero, the interrupt handler does not increment the heartbeat 
field. 

At this point, the routine is finished (step 610) and the 
APMP database header is accessible and the joining instance 50 
is able to examine the header and decide whether the APMP 
computer system does not exist and, therefore, the instance 
must create it, or whether the instance will be joining aa 
already-existing APMP system. 

Once the APMP header is mapped, the header is examined 55 
to determine whether an APMP computer system is up and 
functioning, and, if not, whether the current instance should 
initiahze the APMP database and create the APMP computer 
system. The problem of joining an existing APMP system 
becomes more difficult, for example, if the APMP computer 60 
system was created at one time, but now has no members, or 
if the APMP system is being reformed after an error. In this 
case, the state of the APMP database memory is not known 
in advance, and a simple memory test is not sufficient. An 
instance that is attempting to join a possibly existing APMP 65 
system must be able to determine whether an APMP system 
exists or not and, if it does not, the instance must be able to 



create a new APMP system without interference from other 
instances. This interference could arise from threads running 
either on the same instance or on another instance. 

In order to prevent such interference, the create/join 
decision is made by first locking the APMP database and 
then examining the APMP header to determine whether 
there is a functioning APMP computer system. If there is a 
properly functioning APMP system, then the instance joins 
the system and releases the lock on the APMP database. 
Alternatively, if there is no APMP system, or if the there is 
an APMP system, but it is non-functioning, then the instance 
creates a new APMP system, with itself as a member and 
releases the lock on the APMP database. 

If there appears to be an APMP system in transition, then 
the instance waits until the APMP system is again opera- 
tional or dead, and then proceeds as above. If a system 
cannot be created, then joining fails. 
Creating a new APMP Computer System 

Assuming that a new APMP system must be created, the 
creator instance is responsible for allocating the rest of the 
APMP database, initializing the header and invoking system 
services. Assuming the APMP database is locked as 
described above, the following steps are taken by the creator 
instance to initialize the APMP system (these steps are 
shown in FIGS. 7A and 7B): 

Step 702 the creator instance sets the APMP system state and 

its node block state to "initializing." 
Step 704 the creator instance calls a size routine for each 

system service with the address of its length field in the 

header. 

Step 706 the resulting length fields are summed and the 
creator instance calls memory management to allocate 
space for the entire APMP database by creating a new 
mapping and deleting the old mapping. 

Step 708 the creator instance fills in the ofiEsets to the 
beginnings of each system service segment. 

Step 710 the initialization routine for each service is called 
with the virtual addresses of the APMP database, the 
service segment and the segment length. 

Step 712 the creator instance initializes a membership mask 
to make itself the sole member and increments an incar- 
nation count. It then sets creation time, software version, 
and other creation parameters. 

Step 714 the instance then sets itself as its own big and little 
brother (for heartbeat monitoring purposes as described 
below). 

Step 716 the instance then fills in its instance state as 
"member** and the APMP system state as "operational." 

Step 718 finally, the instance releases the APMP database 
lock. 

The routine then ends in step 720. 
Joining an Existing APMP Computer System 

Assuming an instance has the APMP database locked, the 
following steps are taken by the instance to become a 
member of an existing APMP system (shown in FIGS. 8A 
and 8B): 

Step 802 the instance checks to make sure that its instance 
name is unique. If another current member has the 
instance's proposed name, joining is aborted. 

Step 804 the instance sets the APMP system state and its 
node block state to "instance joining" 

Step 806 the instance calls a memory management routine to 
map the variable portion of the APMP database into its 
local address space. 

Step 808 the instance calls system joining routines for each 
system service with the virtual addresses of the APMP 
database and its segment and its segment length. 
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Step 810 if all system service joining routines report success, 
then the instance joining routine continues. If any system 
service join routine fails, the instance joining process 
must start over and possibly create a new APMP computer 
system. 

Step 812 assuming that success was achieved in step 810, 

the instance adds itself to the system membership mask. 
Step 814 the instance selects a big brother to monitor its 

instance health as set forth below. 
Step 816 the instance fills in its instance state as "member" 

and sets a local membership flag. 
Step 818 the instance releases the configuration database 

lock. 

The routine then ends in step 820. 

The loss of an instance, either through inactivity timeout 
or a crash, is detected by means of a "heartbeat" mechanism 
implemented in the APMP database. Instances will attempt 
to do minimal checking and cleanup and notify the rest of 
the APMP system during an instance crash. When this is not 
possible, system services will detect the disappearance of an 
instance via a software heartbeat mechanism. In particular, 
a "heartbeat" field is allocated in the APMP database for 
each active instance. This field is written to by the corre- 
sponding instance at time intervals that are less than a 
predetermined value, for example, every two milliseconds. 

Any mstance may examine the heartbeat field of any other 
instance to make a direct determination for some specific 
purpose. An instance reads the heartbeat field of another 
instance by reading its heartbeat field twice separated by a 
two millisecond time duration. If the heartbeat is not incre- 
mented between the two reads, the instance is considered 
inactive (gone, halted at control-P, or hung at or above clock 
intermpt priority level.) If the instance remains inactive for 
a predetermined time, then the instance is considered dead 
or disinterested. 

In addition, a special arrangement is used to monitor all 
instances because it is not feasible for every instance to 
watch every other instance, especially as the APMP system 
becomes large. This arrangement uses a "big brother- little 
brother** scheme. More particularly, when an instance joins 
the APMP system, before releasing the lock on the APMP 
database, it picks one of the current members to be its big 
brother and watch over the joining instance. The joining 
instance first assumes big brother duties for its chosen big 
brother's current little brother, and then assigns itself as the 
new little brother of the chosen instance. Conversely, when 
an instance exits the APMP computer system while still in 
operation so that it is able to perform exit processing, and 
while it is holding the lock on the APMP database, it assigns 
its big brother duties to its current big brother before it stops 
incrementing its heartbeat. 

Every clock tick, after incrementing its own heartbeat, 
each instance reads its little brother's heartbeat and com- 
pares it to the value read at the last clock tick. If the new 
value is greater, or the little brother's ID has changed, the 
little brother is considered active. However, if the little 
brother ID and its heartbeat value are the same, the little 
brother is considered inactive, and the current instance 
begins watching its little brother's little brother as well. This 
accumulation of responsibility continues to a predetermined 
maximum and insures that the failure of one instance does 
not result in missing the failure of its little brother. If the 
little brother begins incrementing its heartbeat again, all 
additional responsibilities are dropped. 

If a member instance is judged dead, or disinterested, and 
it has not notified the APMP computer system of its intent 
to shut down or crash, the instance is removed firom the 
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APMP system. This may be done, for example, by setting 
the "bugcheck" bit in the instance primitive interrupt mask 
and sending an IP interrupt to all CPU's of the instance. As 
a rule, shared memory may only be accessed below the 
5 hardware priority of the IP interrupt. This insures that if the 
CPUs in the instance should attempt to execute at a priority 
below that of the IP interrupt, the IP interrupt will occur first 
and thus the CPU will see the "bugcheck" bit before any 
lower priority threads can execute. This insures the operat- 
ic ing system instance will crash and not touch shared 
resources such as memory which may have been reallocated 
for other purposes when the instances were judged dead. As 
an additional or alternative mechanism, a console callback 
(should one exist) can be invoked to remove the instance. In 
IS addition, in accordance with a preferred embodiment, when- 
ever an instance disappears or drops out of the APMP 
computer system without warning, the remaining instances 
perform some sanity checks to determine whether they can 
continue. These checks include verifying that all pages in the 
20 APMP database are still accessible, i.e. that there was not a 
memory failure. 

Assignment of Resources After Joining 

A CPU can have at most one owner partition at any given 
time in the power-up life of an APMP system. However, the 
25 reflection of that ownership and the entity responsible for 
controlling it can change as a result of configuration and 
state transitions undergone by the resource itself, the parti- 
tion it resides within, and the instance running in that 
partition. 

30 CPU ownership is indicated in a number of ways, in a 
number of structures dictated by the entity that is managing 
the resource at the time. In the most basic case, the CPU can 
be in an unassigned state, available to all partitions that 
reside in the same sharing set as the CPU. Eventually that 

35 CPU is assigned to a specific partition, which may or may 
not be running an operating system instance. In either case, 
the partition reflects its ownership to all other partitions 
through the configuration tree structure, and to all operating 
system instances that may run in that partition through the 

40 AVAILABLE bit in the HWRPB per-CPU flags field. 

If the owning partition has no operating system instance 
mnning on it, its console is responsible for responding to, 
and initiating, transition events on the resources within it. 
The console decides if the resource is in a state that allows 

45 it to migrate to another partition or to revert back to the 
unassigned state. 

If, however, there is an instance currently running in the 
partition, the console relinquishes responsibility for initiat- 
ing resource transitions and is responsible for notifying the 

50 running primary of the instance when a configuration change 
has taken place. It is still the facilitator of the underlying 
hardware transition, but control of resource transitions is 
elevated one level up to the operating system instance. The 
transfer of responsibility takes place when the primary CPU 

55 executes its first instruction outside of console mode in a 
system boot. 

Operating system instances can maintain ownership state 
information in any number of ways that promote the most 
efficient usage of the information internally. For example, a 

60 hierarchy of state bit vectors can be used which reflect the 
instance-specific information both internally and globally 
(to other members sharing an APMP database). 

The internal representations are strictly for the use of the 
instance. They are built up at boot time from the underlying 

65 configuration tree and HWRPB information, but are main- 
tained as strict software constructs for the life of the oper- 
ating system instance. They represent the software view of 
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the partition resources available to the instance, and may — The current hardware state of the CPU is saved in step 902, 

through software rule sets — ^further restrict the configuration after which the CPU locates the destination partition, and 

to a subset of that indicated by the physical constructs. determines whether it is a valid destination (step 904). If the 

Nevertheless, all resources in the partition are owned and validation process fails, the original hardware state is 

managed by the instance — using the console mechanisms to 5 restored in step 905, and the CPU resumes operation within 

direct state transitions — until that operating system invoca- the original partition. If the CPU successfully validates the 

tion is no longer a viable entity. That state is indicated by destination partition, the current_owner field of the CPU 

halting the primary CPU once again back into console mode node in the configuration tree is cleared in step 906, and its 

with no possibility of returning without a reboot. available bit is cleared in the per-CPU flags (step 908). The 

Ownership of CPU resources never extends beyond the 10 CPUs caches are then cleared in step 910. 

instance. The state information of each individual instance is Any platform specific state for the CPU is initialized in 

duplicated in an APMP database for read-only decision- step 912 (FIG. 9B), and the available bit for the CPU is 

making purposes, but no other instance can force a state cleared in the per-CPU flags in step 914. The current_owner 

transition event for another's CPU resource. Each instance field is then set in the CPU node of the configuration tree 

is responsible for understanding and controlling its own 15 (step 916) to reflect the ID of the partition to which the CPU 

resource set; it may receive external requests for its has migrated. The CPU is then provided with a hardware 

resources, but only it can make the decision to allow the context (step 918). If a previous hardware state exists for the 

resources to be transferred. CPU (ie. if it has operated previously in that partition), that 

When each such CPU becomes operational, it does not set context is restored. If there is no previous hardware state 

its AVAILABLE bit in the per-CPU flags. When the AVAIL- 20 with that partition (i.e. the CPU has never executed on the 

ABLE bit is not set, no instance will attempt to start, nor partition), or if the previous hardware state is no longer 

expect the CPU to join in SMP operation. Instead, the CPU, valid, the state of the CPU is initialized. Finally, execution 

in console mode, polls the owner field in the configuration of the CPU is resumed in step 920. The execution continues 

tree waiting for a valid partition to be assigned. Once a valid at the instruction following the last migration instruction 

partition is assigned as the owner by the primary console, the 25 executed by the CPU in that partition or, if being initialized, 

CPU will begin operation in that partition. it starts in the console initialization routine as a secondary 

During runtime, the current__owner field reflects the par- processor. As shown in FIG. 9B, the process ends after 

tition where a CPU is executing. The AVAILABLE bit in the execution is resumed. 

per-CPU flags field in the HWRPB remains the ultimate Each time a processor migrates, the console at the desti- 

indicator of whether a CPU is actually available, or 30 nation partition must accommodate the newly-migrated 

executing, for SMP operation with an operating system CPU. FIG. 10 illustrates the steps taken by the console at a 

instance, and has the same meaning as in conventional SMP destination partition to complete the migration. The routine 

systems. begins in step 1000 and proceeds to step 1002, where the 

It should be noted that an instance need not be a member console places a STARTREQ message into the migrated 

of a sharing set to participate in many of the reconfiguration 35 CPU's TK buffer in the per-CPU slot, and sets its TXRDY 

features of an APMP computer system. An instance can bit in the HWRPB. Next the console signals the primary 

transfer its resources to another instance in the APMP CPU in the partition by means of an interrupt as set forth in 

system so that an instance which is not a part of a sharing set step 1004. The migrated CPU polls the RXRDY bit in the 

can transfer a resource to an instance which is part of the HWRPB waiting for a command, such as START to begin 

sharing set. Similarly, the instance which is not a part of the 40 operation as set forth in step 1006. The routine then finishes 

sharing set can receive a resource from an instance which is in step 1008. 

part of the sharing set. When an operating system instance crashes, the CPUs 

Runtime Migration of Resources that are active in the partition will continue to be a part of 

With the present invention, CPUs may be shared in a the same instance at reboot. The CPUs do not migrate 

serial fashion by multiple partitions. Any CPU in the com- 4S automatically to their nominal "owners". Nor do CPUs 

puter system can be moved &om one partition to another, which are ''owned" by a partition migrate back to an 

provided it is not a primary CPU in the partition where it is operating system instance which is crashing or rebooting, 

residing at the time, and is not bound by system constraints, The available bit in the per-CPU flags in the HWRPB 

such as distributed interrupt handling. The policy on when indicates the current ownership. This is also reflected in the 

and where a CPU may migrate is strictly up to the operating 50 current_owner field of the CPU node in the configuration 

system code which the CPU is executing. In the preferred tree. 

embodiment, CPUs migrate by executing a "PAL The operating system may implement an automatic 

MIGRATE" instruction. migration of secondary CPUs as part of its crash logic. That 

The PAL MIGRATE instruction invokes a set of steps is, when a secondary CPU reaches the end of its crash logic, 
which causes a CPU to be moved between instances. This 55 and would typicaUy enter a waiting slate, the operating 
method of migration may be used with other activities that system can implement a policy to cause the CPUs to instead 
may require CPU migration and, in general, involves a migrate to a pre-defined partition. This would allow imple- 
contcxt switch between multiple HWRPBs. When a CPU mentation of directed warm failovcr systems where the 
migrates away from a particular instance, its context is CPUs immediately are available at the warm backup parti- 
stored in the HWRPB associated with the instance on which 60 tion when the primary application partition fails, 
the CPU was running. That way, if the CPU migrates back A software implementation of the above -described 
to an instance where it was previously in operation, the embodiment may comprise a series of computer instnictions 
context may be restored to allow the CPU to resume either fixed on a tangible medium, such as a computer 
execution quickly. The steps in a PAL migration are depicted readable media, e.g. a diskette, a CD-ROM, a ROM 
in HGS. 9A-9B. 65 memory, or a fixed disk, or transmissible to a computer 

Execution of the PAL MIGRATE instruction by a CPU system, via a modem or other interface device over a 

causes the migration routine to start, as shown in step 900. medium. The medium can be either a tangible medium, 
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including but not limited to optical or analog commimica- 
tions lines, or may be implemented with wireless techniques, 
including but not limited to microwave, infrared or other 
transmission techniques. It may also be the Internet. The 
series of computer instructions embodies all or part of the $ 
functionality previously described herein with respect to the 
invention. Those skilled in the art will appreciate that such 
computer instructions can be written in a number of pro- 
gramming languages for use with many computer architec- 
tures or operating systems. Further, such instructions may be 
stored using any memory technology, present or future, 
including, but not limited to, semiconductor, magnetic, 
optical or other memory devices, or transmitted using any 
communications technology, present or future, including but 
not limited to optical, infirared, microwave, or other trans- 
mission technologies. It is contemplated that such a com- 
puter program product may be distributed as a removable 
media with accompanying printed or electronic 
documentation, e.g., shrink wrapped software, pre-loaded 
with a computer system, e.g., on system ROM or fixed disk, 
or distributed from a server or electronic bulletin board over 20 
a network, e.g., the Internet or World Wide Web. 

Although an exemplary embodiment of the invention has 
been disclosed, it will be apparent to those skilled in the art 
that various changes and modifications can be made which 
will achieve some of the advantages of the invention without 25 
departing from the spirit and scope of the invention. For 
example, it will be obvious to those reasonably skilled in the 
art that, although the description was directed to a particular 
hardware system and operating system, other hardware and 
operating system software could be used in the same manner 
as that described. Other aspects, such as the specific instruc- 
tions utilized to achieve a particular function, as well as 
other modifications to the inventive concept are intended to 
be covered by the appended claims. 

What is claimed is: 

1. A computer system having a plurality of processors, 
memory and I/O circuitry, the computer system comprising: 

an interconnection mechanism for electrically intercon- 
necting the processors, memory and I/O circuitry so 
that each processor has electrical access to all of the 
memory and at least some of the I/O circuitry; 

a software mechanism for dividing the processors, 
memory and I/O circuitry into a plurality of partitions, 
each partition including at least one processor, some 
memory and some I/O circuitry; 45 

an operating system instance running in each partition; 
and 

a processor migration apparatus that reassigns a first 
processor fitom a first partition to a second partition, 
wherein said migration apparatus stores a processing 50 
context of the processor relative to the first partition 
prior to the reassignment. 

2. A computer system according to claim 1 wherein the 
migration apparatus reassigns the first processor during 
system operation without a reboot of the entire system. 55 

3. A computer system according to claim 1 wherein the 
migration apparatus initiates an indication to an operating 
system instance in the second partition that the first proces- 
sor is available for use. 

4. A computer system according to claim 1 wherein the 60 
plurality of processors is divided into groups and wherein 
each group comprises a console program via which an 
operator can interact with the processors in the group. 

5. A computer system according to claim 4 wherein the 
migrating apparatus completes the reassignment of the 65 
migrating processor without the intervention of any of the 
console programs. 



6. A computer system according to claim 1 wherein the 
software mechanism is such that each partition includes at 
least one CPU node that corresponds to a particular proces- 
sor that is associated with a memory location in which is 
stored a value identifying the partition with which the 
processor is associated. 

7. A computer system according to claim 1 further com- 
prising a plurality of hardware flags associated with each 
partition, the hardware flags for a particular partition includ- 
ing at least one flag indicating the operational status of a 
particular processor executing on that partition. 

8. A computer system according to claim 7 wherein the 
hardware flags include an availability flag indicating 
whether said particular processor is available to join sym- 
metric multiprocessing (SMP) on that partition. 

9. A computer system according to claim 8 wherein the 
availability flag is a single bit. 

10. A computer system according to claim 7 wherein the 
each set of hardware flags include an ownership flag, and, 
prior to the reassignment of the first processor, a first one of 
the ownership flags for the first partition indicates that the 
first processor is under the control of an instance running on 
the first partition, while, after the reassignment of the first 
processor, the first ownership flag indicates that the first 
processor is no longer under the control of the instance 
running on the first partition. 

11. A computer system according to claim 10 wherein, 
after the reassigiunent, a first ownership flag for the second 
partition indicates that the processor is under the control of 
an instance running on the second partition. 

12. Acomputer system according to claim 1 wherein, after 
the reassignment of the first processor, the first processor 
loads any processing context that it may have stored during 
a previous execution with the second partition. 

13. A computer system according to claim 1 wherein the 
migration apparatus initiates the execution of a migration 
instruction by the first processor. 

14. A computer system having a plurality of processors, 
memory and I/O circuitry, the computer system comprising: 

an interconnection mechanism for electrically intercon- 
necting the processors, memory and I/O circuitry so 
that each processor has electrical access to all of the 
memory and at least some of the I/O circuitry, the 
plurality of processors being physically divided into 
groups wherein each group comprises a console pro- 
gram which controls the processors in the group; 

a software mechanism for dividing the processors, 
memory and I/O circuitry into a plurality of partitions, 
each partition including at least one processor, some 
memory and some I/O circuitry, wherein a plurality of 
hardware flags are associated with each partition, the 
hardware flags for a particular partition including at 
least one flag indicating the operational status of a 
particular processor executing on that partition; 

an operating system instance running in each partition; 
and 

a processor migration apparatus thai reassigns a first 
processor from a first partition to a second partition 
wherein, prior to the reassignment, said migration 
apparatus causes the first processor to store its current 
processing context with the first partition and, after the 
reassignement, causes the first processor to load any 
processing context which it may have stored from a 
previous execution within the second partition. 

15. A method of operating a multiple processor computing 
system having a plurality of processors, memory and I/O 
circuitry, the method providing: 
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electrically interconnecting the processors, memory and 
I/O circuitry so that each processor has electrical access 
to all of the memory and at least some of the I/O 
circuitry; 

Tising a software mechanism to divide the processors, ^ 
memory and I/O circuitry into a plurality of partitions, 
each partition including at least one processor, some 
memory and some I/O circuitry; 

running an operating system instance in each partition; 
and ^° 

reassigning a first processor from a first partition to a 
second partition, wherein said reassigning comprises 
causing, prior to the reassignment, the first processor to 
store a processing context relative to the first partition. 

16. A method according to claim 15 wherein reassigning 
the first processor comprises reassigning the first processor 
without rebooting the entire system. 
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17. A method according to claim 15 further comprising 
maintaining configuration information indicating which of 
the plurality of processors is assigned to each partition. 

18. A method according to claim 17 wherein the reas- 
signing further comprises modifying the configuration infor- 
mation relative to the assignment of the first processor. 

19. A method according to claim 15 further comprising 
communicating with an operating system instance in the 
second partition to instmct said instance to begin that the 
first processor is available for use. 

20. A method according to claim 15 wherein the reas- 
signing further comprises causing the first processor, after 
reassignment, to load any processing context which it may 
have stored from a previous execution with the second 
partition. 
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